In this notebook, we are simulating single cell RNA-seq data for the purpose of testing approaches for differential expression analysis. The following scenario is considered:

This setup could reflect data of a single cell type or single cell data obtained from ex vivo cell lines. Sex specific differences in gene expression could reflect the differences in prevalence or progression for diseaseas such as Parkinson’s.

The frameworks that are used for differential expression analysis are Seurat and edgeR.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
## 
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
## 
##     intersect, t
library(patchwork)
library(ggplot2)
library(edgeR)
## Loading required package: limma
source('simulate_single_cell.R')
set.seed(42)

Simulating a count matrix

First, we load human gene names. These will be used in a randomized fashion.

path_gene_names <- "brain_genes.txt"
list_init_genes <- read.csv(path_gene_names, header=F)$V1

A count matrix is created based on a list of average expression values for each gene. The following code is used:

getAnywhere(init_count_matrix)
## A single object matching 'init_count_matrix' was found
## It was found in the following places
##   .GlobalEnv
## with value
## 
## function (n_samples, cell_tag, list_genes, list_mu_expression, 
##     size = 5) 
## {
##     if (length(list_genes) != length(list_mu_expression)) {
##         stop("Size of `list_genes` does not match size of `list_mu_expression`")
##     }
##     list_mu_expression[list_mu_expression < 0] <- 0
##     cell_labels <- paste0(cell_tag, as.character(seq(1:n_samples)))
##     mat_counts <- t(sapply(list_mu_expression, get_neg_binomial, 
##         n = n_samples, size = 5))
##     rownames(mat_counts) <- list_genes
##     colnames(mat_counts) <- cell_labels
##     return(mat_counts)
## }

This matrix will contain raw integer counts that are equal to or greater than zero. The expression of a gene is modeled as a negative binomial distribution over the cells of a sample. A negative binomial distribution reflects single cell gene expression well given that only fraction of the actual transcripts is detected by single cell sequencing. This leads to a distribution which is skewed towards zero. A helper function for generating negative binomial distributions:

getAnywhere(get_neg_binomial)
## A single object matching 'get_neg_binomial' was found
## It was found in the following places
##   .GlobalEnv
## with value
## 
## function (mu, n, size = 5) 
## {
##     vals <- round(rnbinom(n, size = size, mu = mu), 0)
##     return(vals)
## }

Parameters

To obtain some variation between the expression of genes, we model their average expression as a normal distribution. This could also be substituted by any distribution of choice.

# parameters
n_samples <- 500
mean_expr <- 10
sd_expr <- 2
sd_noise <- 0.1 # up to 10% of noise will be added to mean expression values of the samples 

n_diff_genes = 100
diff_mean_expr <- 50 # for differentially expressed genes, we use higher mean expression and variance
diff_sd_expr <- 3

# to reduce the amount of resources, 2000 gene names are picked randomly
n_selected_genes <- 2000
list_genes <- sample(list_init_genes, n_selected_genes, replace=FALSE) 
n_genes <- length(list_genes)

# pseudobulks will be used for edgeR analysis
n_pseudobulks <- 5

# intial mean expression of the genes
expression_mu <- rnorm(length(list_genes), mean=mean_expr, sd=sd_expr)

The expression of genes is modeled seperately for each sample but based on the same mean expression values. To add variation between the samples, up to 10% of noise is added to the mean expression of each gene.

Create a count matrix for female control data

tag <- 'ctrl_f_'

noise <- rnorm(n_genes, mean=0, sd=sd_noise)
expression_mu_tmp <- expression_mu + expression_mu * noise
counts_ctrl_f <- init_count_matrix(n_samples, cell_tag=tag, list_genes=list_genes, 
                                   list_mu_expression=expression_mu_tmp)

Create a count matrix for female disease data

The female disease data will not show characteristic differentially expressed genes to increase the complexity of the data.

tag <- 'pat_f_'

noise <- rnorm(n_genes, mean=0, sd=sd_noise)
expression_mu_tmp <- expression_mu + expression_mu * noise

counts_pat_f <- init_count_matrix(n_samples, cell_tag=tag, list_genes=list_genes, 
                                   list_mu_expression=expression_mu_tmp)

Create a count matrix for male control data

tag <- 'ctrl_m_'

noise <- rnorm(n_genes, mean=0, sd=sd_noise)
expression_mu_tmp <- expression_mu + expression_mu * noise

counts_ctrl_m <- init_count_matrix(n_samples, cell_tag=tag, list_genes=list_genes, 
                                   list_mu_expression=expression_mu_tmp)

Creating a count matrix for male disease data

For the male disease sample, we pick 100 genes at random and model their expression based on a higher mean expression and variance

# change distribution for selected genes
tag <- 'pat_m_'

noise <- rnorm(n_genes, mean=0, sd=sd_noise)
expression_mu_tmp <- expression_mu + expression_mu * noise

idx_diff_genes <- sample(seq(1:n_genes), n_diff_genes, replace=FALSE)

expression_mu_tmp[idx_diff_genes] <- rnorm(n_diff_genes, mean=diff_mean_expr, sd=diff_sd_expr)

counts_pat_m <- init_count_matrix(n_samples, cell_tag=tag, list_genes=list_genes, 
                                   list_mu_expression=expression_mu_tmp)

The count matrices of the four samples is merged into a single count matrix.

list_counts <- list(counts_ctrl_f, counts_pat_f, counts_ctrl_m, counts_pat_m)
counts <- do.call(cbind, list_counts)
dim(counts)
## [1] 2000 2000

Meta data

AS meta date we assign a sample identifier, the sex, and whether it is a control or disease sample.

sample <- rep("sample-1", n_samples)
sample <- c(sample, rep("sample-2", n_samples) )
sample <- c(sample, rep("sample-3", n_samples) )
sample <- c(sample, rep("sample-4", n_samples) )


sex <- rep("female", n_samples)
sex <- c(sex, rep("female", n_samples) )
sex <- c(sex, rep("male", n_samples) )
sex <- c(sex, rep("male", n_samples) )

pathology <- rep("control", n_samples)
pathology <- c(pathology, rep("disease", n_samples))
pathology <- c(pathology, rep("control", n_samples))
pathology <- c(pathology, rep("disease", n_samples))

edgeR analysis is performed on pseudobulks. To improve the robustness of the inference, we will be creating 5 pseudobulks per sample. We already at this information to the meta data at this stage.

# pseudobulk entries
n_samp_per_pseudo <- n_samples / n_pseudobulks


pseudobulk <- c()
tag <- "pblk"
for(sid in c("sample-1", "sample-2", "sample-3", "sample-4")) {
  
  temp_pseudobulks <- c()
  for(i in 1:n_pseudobulks) {
    temp_pseudobulks <- c(temp_pseudobulks, rep(paste(tag, as.character(i), 
                                                      sep="-"), n_samp_per_pseudo))
  }
  pseudobulk <- c(pseudobulk, sample(temp_pseudobulks, length(temp_pseudobulks), 
                                     replace=FALSE))
}

Meta data entries are simply compiled in a data frame.

df_meta <- data.frame(sample, sex, pathology, pseudobulk)
rownames(df_meta) <- colnames(counts)
df_meta

Seurat analysis

Seurat will be used to investigate the structure of our data ….

First, we create a Seurat object from the count data.

# create Seurat object
seurat_obj <- CreateSeuratObject(counts = counts, project = "simulated_data", min.cells = 3, min.features = 200)
## Warning: Data is of class matrix. Coercing to dgCMatrix.
seurat_obj
## An object of class Seurat 
## 2000 features across 2000 samples within 1 assay 
## Active assay: RNA (2000 features, 0 variable features)
##  1 layer present: counts

Second, we add meta data.

seurat_obj <- AddMetaData(object=seurat_obj, metadata=df_meta)

Third, we split the samples into different layers. This will allows as to integrate samples downstream.

seurat_obj[["RNA"]] <- split(seurat_obj[["RNA"]], f = seurat_obj$sample)
seurat_obj
## An object of class Seurat 
## 2000 features across 2000 samples within 1 assay 
## Active assay: RNA (2000 features, 0 variable features)
##  4 layers present: counts.sample-1, counts.sample-2, counts.sample-3, counts.sample-4

Counts per cells are log-normalized.

seurat_obj <- NormalizeData(seurat_obj, normalization.method = "LogNormalize", scale.factor = 10000)
## Normalizing layer: counts.sample-1
## Normalizing layer: counts.sample-2
## Normalizing layer: counts.sample-3
## Normalizing layer: counts.sample-4

First, we will identify variable genes.

seurat_obj <- FindVariableFeatures(seurat_obj, selection.method = "vst", nfeatures = 200)
## Finding variable features for layer counts.sample-1
## Finding variable features for layer counts.sample-2
## Finding variable features for layer counts.sample-3
## Finding variable features for layer counts.sample-4
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(seurat_obj), 10)

# plot variable features with and without labels
plot <- VariableFeaturePlot(seurat_obj)
plot <- LabelPoints(plot=plot, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot

Dimensionality reduction

To get a better understanding of our data, we perform a dimensional reduction using PCA and compute an embedding of the cells, a UMAP.

all.genes <- rownames(seurat_obj)
seurat_obj <- ScaleData(seurat_obj, features = all.genes)
## Centering and scaling data matrix
seurat_obj <- RunPCA(seurat_obj, features = VariableFeatures(object = seurat_obj))
## PC_ 1 
## Positive:  UGCG, LINC01138, RTN4RL2, GFRA1, BRSK2, JAM2, SRPK1, BBIP1, NCEH1, TMOD2 
##     CS, DTD1, NEDD4, TMEFF2, SF3B3, SHPRH, SMG7, ATP5G3, LOC101929256, ST7L 
##     LOC105373738, ADI1, LOC101927960, EDEM3, NWD2, UQCR11, DIDO1, TTC3, IGSF21, RHOU 
## Negative:  LOC403323, FAIM2, UNC13B, TAF2, TOPORS, ATP6V1B2, TMEM259, SCD, NBPF3, PLEKHA3 
##     CTSF, REEP3, MCC, ARL14EP, CACNA2D1, LINC00639, UPF3A, DDX21, PLAG1, RAB3C 
##     ERBB2IP, VRK3, HID1, ST8SIA4, EIF3H, MARS, DNAJA4, TMEM169, PTS, DCP2 
## PC_ 2 
## Positive:  SHPRH, SMG7, SCD, LOC105379165, ZC3H6, GUSB, SLC38A9, TTC3, GLIPR1, AP3D1 
##     PCDH9-AS2, EPB41L4B, CHRM3, TMEM192, SGCZ, DDX21, CS, LINC00639, FOXG1-AS1, SYNJ2 
##     CACNA2D1, ST7L, ARF5, ATP6V0D1, SLC25A29, WTAP, C4orf47, ZNF862, TBL1X, PLS3 
## Negative:  ZNF790, ACTR3, SNX10, LOC102724782, ZNF440, RAB18, UPF3A, ZNF420, SLC24A3, ATP5G3 
##     PTS, PREX2, LMAN1, ACAP3, CSRNP3, NFE2L2, TMEM169, DIDO1, TMEM259, MCC 
##     HES6, FAM169A, ATAD2B, LOC105376004, LOC403323, GPR89A, SMIM14, BAIAP2, LINC01088, PKP4 
## PC_ 3 
## Positive:  USP33, C4orf47, TTC3, TBL1X, KIAA2018, NIPA2, FRZB, PLEKHA3, LOC105376004, SLC24A3 
##     TMEM254, HES6, LOC105376956, DTD1, NFE2L2, COL4A5, PLXDC1, SLC38A9, ATP6V0D1, BRWD3 
##     ZNF440, TMEFF2, PHF7, ST7L, REEP3, SNX10, SYNJ2, SMIM14, ZCCHC14, PCF11 
## Negative:  TMEM192, SYNPO, LCOR, DNAJA4, LOC403323, DLX6-AS1, TOMM40, TMEM259, ACTR3, LOC101929256 
##     IGSF21, EDEM3, TMEM169, ST8SIA4, LMAN1, SLC25A29, PRRT1, CCDC148, TOPORS, SPIDR 
##     LOC101927960, DPYS, WTAP, FAM69B, TAPT1-AS1, BEAN1, AARS, GIPC2, EPB41L4B, PTS 
## PC_ 4 
## Positive:  KCTD8, DPYS, UQCR11, SHPRH, LINC01088, C16orf45, LOC102724782, TOMM40, UCK2, FRZB 
##     EIF3H, ZNF862, TAPT1-AS1, TMEM254, PCF11, MCC, PART1, ARMC10, LOC102724058, ZNF26 
##     ERBB2IP, FAM69B, PTS, BAIAP2, ATP5G3, AP3D1, ZNF250, PLXDC1, BCL2, ZCCHC14 
## Negative:  NADK2, TMEM243, SLC38A9, KIAA2018, CHRM3, RAB18, MTR, PSPC1, NCOA2, EXOC5 
##     LCOR, FLJ39080, GLIPR1, CALM3, IGSF21, LOC105370027, VRK3, TFG, SMG7, ATAD2B 
##     ATP2A2, PLS3, ZNF440, CEP76, NEDD4, HAX1, LOC102724416, LOC105373738, ZDHHC21, ZNF91 
## PC_ 5 
## Positive:  COX14, ATP6V0D1, CTSF, PLXNA4, CCDC148, RWDD1, CELSR2, TRPM4, IMMP1L, LOC101929256 
##     ZNF250, KIAA2018, C16orf45, PART1, TMEM169, FLJ39080, PCDH9-AS2, WTAP, MARS, TCEAL3 
##     PLXDC1, TMEM243, GIPC2, LMAN1, ZNF804A, DIDO1, RNF111, ZNF26, SCD, AARS 
## Negative:  ZCCHC14, C4orf27, BAIAP2, EIF3H, NCOA2, FOXG1-AS1, RHEB, SYNPO, IGSF21, MAP2K4 
##     RAB18, PLEKHA3, TMEM185A, FAIM2, CEP76, UQCR11, FAHD2A, SPIDR, HN1, NBPF3 
##     FAM69B, FAM169A, UBE2J1, ATAD2B, TAF2, ANXA7, SLCO1C1, ST7L, NEDD4, ZDHHC21
VizDimLoadings(seurat_obj, dims = 1:2, reduction = "pca")

DimPlot(seurat_obj, reduction = "pca", group.by='sex')

DimPlot(seurat_obj, reduction = "pca", group.by='pathology')

We can already distinguish the male disease sample in the PCA based on the variance of gene expression, most likely due to the differentially expressed genes.

ElbowPlot(seurat_obj)

seurat_obj <- FindNeighbors(seurat_obj, dims = 1:5)
## Computing nearest neighbor graph
## Computing SNN
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2000
## Number of edges: 62177
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8222
## Number of communities: 4
## Elapsed time: 0 seconds
seurat_obj <- RunUMAP(seurat_obj, dims = 1:5)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 18:16:29 UMAP embedding parameters a = 0.9922 b = 1.112
## 18:16:29 Read 2000 rows and found 5 numeric columns
## 18:16:29 Using Annoy for neighbor search, n_neighbors = 30
## 18:16:29 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 18:16:29 Writing NN index file to temp file /tmp/RtmpMw3WaM/file973a3e480b6f
## 18:16:29 Searching Annoy index using 1 thread, search_k = 3000
## 18:16:30 Annoy recall = 100%
## 18:16:30 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 18:16:30 Initializing from normalized Laplacian + noise (using RSpectra)
## 18:16:30 Commencing optimization for 500 epochs, with 71984 positive edges
## 18:16:30 Using rng type: pcg
## 18:16:32 Optimization finished
DimPlot(seurat_obj, reduction = "umap", group.by = c('sample'))

seurat_obj <- IntegrateLayers(
  object = seurat_obj, method = CCAIntegration,
  orig.reduction = "pca", new.reduction = "integrated.cca",
  verbose = FALSE
)
## Warning: The `slot` argument of `GetAssayData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `slot` argument of `SetAssayData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Similarly to the PCA, the UMAP shows that the male diseae sample exhibits differences to the other three samples.

Sample integration

To verify that our simulated samples not too distinct from one another, we are integrating the samples using canonical correlation analysis (CCA). This will show us whether the samples follow overall a similar distribution of gene expression after considering sample batch effects.

seurat_obj <- FindNeighbors(seurat_obj, reduction = "integrated.cca", dims = 1:5)
## Computing nearest neighbor graph
## Computing SNN
seurat_obj <- FindClusters(seurat_obj, resolution = 2, cluster.name = "cca_clusters")
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2000
## Number of edges: 59115
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.5333
## Number of communities: 19
## Elapsed time: 0 seconds
obj <- RunUMAP(seurat_obj, reduction = "integrated.cca", dims = 1:5, reduction.name = "umap.cca")
## 18:16:42 UMAP embedding parameters a = 0.9922 b = 1.112
## 18:16:42 Read 2000 rows and found 5 numeric columns
## 18:16:42 Using Annoy for neighbor search, n_neighbors = 30
## 18:16:42 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 18:16:42 Writing NN index file to temp file /tmp/RtmpMw3WaM/file973a28c16487
## 18:16:42 Searching Annoy index using 1 thread, search_k = 3000
## 18:16:42 Annoy recall = 100%
## 18:16:43 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 18:16:43 Initializing from normalized Laplacian + noise (using RSpectra)
## 18:16:43 Commencing optimization for 500 epochs, with 74886 positive edges
## 18:16:43 Using rng type: pcg
## 18:16:45 Optimization finished
DimPlot(
  obj,
  reduction = "umap.cca",
  group.by = c("sample", "sex", "pathology"),
  combine = FALSE, label.size = 2
)
## [[1]]

## 
## [[2]]

## 
## [[3]]

We can indeed observe that all samples integrate well despite the differential expression of the 100 genes. This is usually the case for data that was obtained from different samples but similar cell types.

For downstream analysis we are combing the samples back into a single layer.

seurat_obj <- JoinLayers(seurat_obj)

Differntial expression analysis using Wilcoxon signed-rank tests

As single cell RNA-seq data does not typically follow a normal distribution, non-parameteric tests, which have less requirements, are a preferred method for differential expression analysis. Here we are using Wilcoxon signed-rank tests, which are a default method for single cell analysis frameworks like Seurat and SCANPY. Per default P-value are adjusted by a Bonferroni correction for multiple testing.

We first compare the control against disease condition.

Idents(seurat_obj) <- 'pathology'
pathology.de.markers <- FindMarkers(seurat_obj, ident.1 = "disease", ident.2 = "control", test.use = "wilcox", logfc.threshold = 0)
## Warning: `PackageCheck()` was deprecated in SeuratObject 5.0.0.
## ℹ Please use `rlang::check_installed()` instead.
## ℹ The deprecated feature was likely used in the Seurat package.
##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more 
## efficient implementation (no further action necessary).
## This message will be shown once per session
dim(pathology.de.markers)
## [1] 2000    5
 pathology.de.markers[ order(pathology.de.markers$avg_log2FC, decreasing=TRUE),]

To assess how well the Wilcoxon test is detecting the differentially expressed genes, rank the genes by adjusted p-value. This will be compared downstream for the 100 genes that we simulated to be differentially expressed.

is_diff <- numeric(dim(pathology.de.markers)[1])
is_diff[idx_diff_genes] <- 1
names(is_diff) <- list_genes

pathology.de.markers$is_diff <- is_diff[rownames(pathology.de.markers)]
pathology.de.markers
ranked <- rank(pathology.de.markers$p_val_adj, ties='first')

df1 <- data.frame( rank=ranked )
df1$contrast <- "Wilcox - pathology"
df1$is_diff <- pathology.de.markers$is_diff
df1 <- df1[df1$is_diff == 1,]
df1

Next, we identify genes that are differentially expressed between the male disease sample (sample-4).

Idents(seurat_obj) <- 'sample'
sample4.de.markers <- FindMarkers(seurat_obj, ident.1 = "sample-4", ident.2 = NULL, test.use = "wilcox", logfc.threshold = 0)
dim(sample4.de.markers)
## [1] 2000    5
 sample4.de.markers[ order(sample4.de.markers$avg_log2FC, decreasing=TRUE),]
sample4.de.markers$is_diff <- is_diff[rownames(sample4.de.markers)]
sample4.de.markers
ranked <- rank(sample4.de.markers$p_val_adj, ties='first')

df2 <- data.frame( rank=ranked )
df2$contrast <- "Wilcox - sample"
df2$is_diff <- sample4.de.markers$is_diff
df2 <- df2[df2$is_diff == 1,]
df2

We combine the ranks for both contrasts (pathology and sample) in a data frame.

df_wilcox_ranks <- rbind(df1, df2)

Examples of differentially expressed genes

features <- c("UGCG", "RASA3", "HUNK", "EIF2AK2", "POLR3G", "SCML1")
RidgePlot(seurat_obj, features = features, group.by = 'sample', ncol=3)
## Warning: The `slot` argument of `FetchData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
##   Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Picking joint bandwidth of 0.114
## Picking joint bandwidth of 0.119
## Picking joint bandwidth of 0.12
## Picking joint bandwidth of 0.122
## Picking joint bandwidth of 0.117
## Picking joint bandwidth of 0.12

edgeR differentially expression analysis

edgeR was initially developed for bulk RNA-seq. To apply the framework to single cell data, we have to aggregate the cells into pseudobulks. Here we create 5 pseudobulks per sample that consists of each 100 cells.

pseudo_seurat_obj <- AggregateExpression(seurat_obj, assays = 'RNA', return.seurat = T, 
                                         group.by = c('sample', 'sex', 'pathology', 'pseudobulk'))
## Centering and scaling data matrix
dim(pseudo_seurat_obj)
## [1] 2000   20

This results in overall 20 pseudobulks.

Examples of differentially expressed genes after creating pseudobulks

features <- c("UGCG", "RASA3", "HUNK", "EIF2AK2", "POLR3G", "SCML1")
RidgePlot(pseudo_seurat_obj, features = features, group.by = 'sample', ncol=3)
## Picking joint bandwidth of 0.0289
## Picking joint bandwidth of 0.0195
## Picking joint bandwidth of 0.0237
## Picking joint bandwidth of 0.0274
## Picking joint bandwidth of 0.0146
## Picking joint bandwidth of 0.0136

First we create an edgeR object from the pseudobulk counts and normalize the reads.

pblk_counts <- pseudo_seurat_obj@assays$RNA$counts
df_pblk_meta <- pseudo_seurat_obj@meta.data

y <- DGEList(pblk_counts, group = df_pblk_meta$pathology)
y <- normLibSizes(y)

EdgeR fits a generalized linear model (GLM) that allows us to explicitly model different properties of the data. edgeR is a very suitable for sequencing data as it uses a negative binomial model fitting the assumptions about the underlying distribution of gene expression. As a first step, we define a design matrix that allows us to take multiple factors into account such as the pathology state or associated sex of a sample.

pathology <- df_pblk_meta$pathology
sex <- df_pblk_meta$sex
sample <- gsub("-", "_", df_pblk_meta$sample)

design <- model.matrix(~ pathology + sex)
design
##    (Intercept) pathologydisease sexmale
## 1            1                0       0
## 2            1                0       0
## 3            1                0       0
## 4            1                0       0
## 5            1                0       0
## 6            1                1       0
## 7            1                1       0
## 8            1                1       0
## 9            1                1       0
## 10           1                1       0
## 11           1                0       1
## 12           1                0       1
## 13           1                0       1
## 14           1                0       1
## 15           1                0       1
## 16           1                1       1
## 17           1                1       1
## 18           1                1       1
## 19           1                1       1
## 20           1                1       1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$pathology
## [1] "contr.treatment"
## 
## attr(,"contrasts")$sex
## [1] "contr.treatment"

This design matrix is used for fitting the model.

y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)

Based on the expression differences, we see that pseudobulks but not samples cluster closely together.

plotMDS(y, col=ifelse(y$samples$group == "disease", "red", "blue"))

By plotting the variation of gene expression against their expression amplitude, we can see the differentially expressed genes (top right).

plotBCV(y)

An advantage of generalized linear models, is that we can consider more complex contrast such as differences based on pathology and sex.

myContrast <- makeContrasts('sexmale+pathologydisease', levels = design)
## Warning in makeContrasts("sexmale+pathologydisease", levels = design): Renaming
## (Intercept) to Intercept
qlf <- glmQLFTest(fit, contrast=myContrast)

res <- topTags(qlf, n = Inf, adjust.method="bonferroni")
res <- res$table
res$is_diff <- is_diff[rownames(res)]
res
ranked <- rank(res$FWER, ties='first')

df1 <- data.frame( rank=ranked )
df1$contrast <- "edgeR - pathology + sex"
df1$is_diff <- res$is_diff
df1 <- df1[df1$is_diff == 1,]

We will compare the combined contrast above to a contrast only using pathology.

myContrast <- makeContrasts('pathologydisease', levels = design)
## Warning in makeContrasts("pathologydisease", levels = design): Renaming
## (Intercept) to Intercept
qlf <- glmQLFTest(fit, contrast=myContrast)

res <- topTags(qlf, n = Inf, adjust.method="bonferroni")
res <- res$table
res$is_diff <- is_diff[rownames(res)]
ranked <- rank(res$FWER, ties='first')

df2 <- data.frame( rank=ranked )
df2$contrast <- "edgeR - pathology"
df2$is_diff <- res$is_diff
df2 <- df2[df2$is_diff == 1,]

As a comparison we will also compute a contrast for the male disease samples against the other samples directly.

pblk_counts <- pseudo_seurat_obj@assays$RNA$counts
df_pblk_meta <- pseudo_seurat_obj@meta.data

y <- DGEList(pblk_counts, group = df_pblk_meta$pathology)
y <- normLibSizes(y)
pathology <- df_pblk_meta$pathology
sex <- df_pblk_meta$sex
sample <- gsub("-", "_", df_pblk_meta$sample)

design <- model.matrix(~ sample)
design
##    (Intercept) samplesample_2 samplesample_3 samplesample_4
## 1            1              0              0              0
## 2            1              0              0              0
## 3            1              0              0              0
## 4            1              0              0              0
## 5            1              0              0              0
## 6            1              1              0              0
## 7            1              1              0              0
## 8            1              1              0              0
## 9            1              1              0              0
## 10           1              1              0              0
## 11           1              0              1              0
## 12           1              0              1              0
## 13           1              0              1              0
## 14           1              0              1              0
## 15           1              0              1              0
## 16           1              0              0              1
## 17           1              0              0              1
## 18           1              0              0              1
## 19           1              0              0              1
## 20           1              0              0              1
## attr(,"assign")
## [1] 0 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$sample
## [1] "contr.treatment"
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
myContrast <- makeContrasts('samplesample_4'
                            , levels = design)
## Warning in makeContrasts("samplesample_4", levels = design): Renaming
## (Intercept) to Intercept
qlf <- glmQLFTest(fit, contrast=myContrast)

res <- topTags(qlf, n = Inf, adjust.method="bonferroni")
res <- res$table
res$is_diff <- is_diff[rownames(res)]
res
ranked <- rank(res$FWER, ties='first')

df3 <- data.frame( rank=ranked )
df3$contrast <- "edgeR - sample"
df3$is_diff <- res$is_diff
df3 <- df3[df3$is_diff == 1,]
df3
df_edgeR_ranks <- rbind(df1, df2, df3)
library(ggplot2)
# Basic violin plot

df_ranks <- rbind(df_edgeR_ranks, df_wilcox_ranks)

p <- ggplot(df_ranks, aes(x=contrast, y=rank, fill=contrast)) + 
  geom_violin(trim=FALSE) +
  geom_dotplot(binaxis='y', stackdir='center', dotsize=0.01) +
  geom_jitter(shape=16, position=position_jitter(0.2))
p
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

The 100 differentially expressed genes have a perfect recall (see below), i.e. the top 100 genes are the genes of interest, when sample-specific contrasts are performed with either edgeR or Wilcoxon tests. As one might expect, based on pathology alone edgeR does not rank the 100 genes well. This improves substantially when both pathology and sex is taking into account. Wilcoxon tests perform suprisingly well on patholohy alone. However, this might also indicate a weakness of the single cell Wilcoxon test as it suggests that half of the cells (from the female disease sample) have little impact on the inference. So, it might pick up effects specific to a sample that are not characteristic for other samples in the same group.

contrasts <- unique(df_ranks$contrast)
max_rank <- c()
for(con in c(contrasts)) {
  max_rank <- c(max_rank, max(df_ranks[ df_ranks$contrast == con,]$rank))
}
df_max_rank <- data.frame(contrast=contrasts, max_rank)
df_max_rank